human review
AI-assisted workflow enables rapid, high-fidelity breast cancer clinical trial eligibility prescreening
Rosenthal, Jacob T., Hahesy, Emma, Chalise, Sulov, Zhu, Menglei, Sabuncu, Mert R., Braunstein, Lior Z., Li, Anyi
Clinical trials play an important role in cancer care and research, yet participation rates remain low. We developed MSK-MATCH (Memorial Sloan Kettering Multi-Agent Trial Coordination Hub), an AI system for automated eligibility screening from clinical text. MSK-MATCH integrates a large language model with a curated oncology trial knowledge base and retrieval-augmented architecture providing explanations for all AI predictions grounded in source text. In a retrospective dataset of 88,518 clinical documents from 731 patients across six breast cancer trials, MSK-MATCH automatically resolved 61.9% of cases and triaged 38.1% for human review. This AI-assisted workflow achieved 98.6% accuracy, 98.4% sensitivity, and 98.7% specificity for patient-level eligibility classification, matching or exceeding performance of the human-only and AI-only comparisons. For the triaged cases requiring manual review, prepopulating eligibility screens with AI-generated explanations reduced screening time from 20 minutes to 43 seconds at an average cost of $0.96 per patient-trial pair.
- North America > United States > Florida > Miami-Dade County > Miami (0.04)
- North America > Canada (0.04)
- Europe > Italy > Tuscany > Pisa Province > Pisa (0.04)
- Research Report > New Finding (1.00)
- Research Report > Experimental Study (1.00)
- Health & Medicine > Therapeutic Area > Obstetrics/Gynecology (1.00)
- Health & Medicine > Therapeutic Area > Oncology > Breast Cancer (0.71)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Agents (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.47)
REMOR: Automated Peer Review Generation with LLM Reasoning and Multi-Objective Reinforcement Learning
Taechoyotin, Pawin, Acuna, Daniel
AI-based peer review systems tend to produce shallow and overpraising suggestions compared to human feedback. Here, we evaluate how well a reasoning LLM trained with multi-objective reinforcement learning (REMOR) can overcome these limitations. We start by designing a multi-aspect reward function that aligns with human evaluation of reviews. The aspects are related to the review itself (e.g., criticisms, novelty) and the relationship between the review and the manuscript (i.e., relevance). First, we perform supervised fine-tuning of DeepSeek-R1-Distill-Qwen-7B using LoRA on PeerRT, a new dataset of high-quality top AI conference reviews enriched with reasoning traces. We then apply Group Relative Policy Optimization (GRPO) to train two models: REMOR-H (with the human-aligned reward) and REMOR-U (with a uniform reward). Interestingly, the human-aligned reward penalizes aspects typically associated with strong reviews, leading REMOR-U to produce qualitatively more substantive feedback. Our results show that REMOR-U and REMOR-H achieve more than twice the average rewards of human reviews, non-reasoning state-of-the-art agentic multi-modal AI review systems, and general commercial LLM baselines. We found that while the best AI and human reviews are comparable in quality, REMOR avoids the long tail of low-quality human reviews. We discuss how reasoning is key to achieving these improvements and release the Human-aligned Peer Review Reward (HPRR) function, the Peer Review Reasoning-enriched Traces (PeerRT) dataset, and the REMOR models, which we believe can help spur progress in the area.
- North America > United States > Colorado > Boulder County > Boulder (0.14)
- Europe > Austria > Salzburg > Salzburg (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.49)
The AI Imperative: Scaling High-Quality Peer Review in Machine Learning
Wei, Qiyao, Holt, Samuel, Yang, Jing, Wulfmeier, Markus, van der Schaar, Mihaela
Peer review, the bedrock of scientific advancement in machine learning (ML), is strained by a crisis of scale. Exponential growth in manuscript submissions to premier ML venues such as NeurIPS, ICML, and ICLR is outpacing the finite capacity of qualified reviewers, leading to concerns about review quality, consistency, and reviewer fatigue. This position paper argues that AI-assisted peer review must become an urgent research and infrastructure priority. We advocate for a comprehensive AI-augmented ecosystem, leveraging Large Language Models (LLMs) not as replacements for human judgment, but as sophisticated collaborators for authors, reviewers, and Area Chairs (ACs). We propose specific roles for AI in enhancing factual verification, guiding reviewer performance, assisting authors in quality improvement, and supporting ACs in decision-making. Crucially, we contend that the development of such systems hinges on access to more granular, structured, and ethically-sourced peer review process data. We outline a research agenda, including illustrative experiments, to develop and validate these AI assistants, and discuss significant technical and ethical challenges. We call upon the ML community to proactively build this AI-assisted future, ensuring the continued integrity and scalability of scientific validation, while maintaining high standards of peer review.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.40)
- North America > United States > California (0.14)
- Research Report > Experimental Study (1.00)
- Overview (1.00)
- Education (1.00)
- Health & Medicine > Therapeutic Area (0.46)
Abstract2Appendix: Academic Reviews Enhance LLM Long-Context Capabilities
Li, Shengzhi, Kampa, Kittipat, Lin, Rongyu, Li, Bohang, Pei, Shichao
Large language models (LLMs) have shown remarkable performance across various tasks, yet their ability to handle long-context reading remains challenging. This study explores the effectiveness of leveraging high-quality academic peer review data for fine-tuning LLMs to enhance their long-context capabilities. We compare the Direct Preference Optimization (DPO) method with the Supervised Fine-Tuning (SFT) method, demonstrating DPO's superiority and data efficiency. Our experiments show that the fine-tuned model achieves a 4.04-point improvement over phi-3 and a 2.6\% increase on the Qasper benchmark using only 2000 samples. Despite facing limitations in data scale and processing costs, this study underscores the potential of DPO and high-quality data in advancing LLM performance. Additionally, the zero-shot benchmark results indicate that aggregated high-quality human reviews are overwhelmingly preferred over LLM-generated responses, even for the most capable models like GPT-4o. This suggests that high-quality human reviews are extremely rich in information, reasoning, and long-context retrieval, capabilities that even the most advanced models have not fully captured. These findings highlight the high utility of leveraging human reviews to further advance the field.
- North America > United States > Massachusetts > Suffolk County > Boston (0.14)
- Asia > Middle East > Jordan (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- (10 more...)
Generation, Distillation and Evaluation of Motivational Interviewing-Style Reflections with a Foundational Language Model
Brown, Andrew, Zhu, Jiading, Abdelwahab, Mohamed, Dong, Alec, Wang, Cindy, Rose, Jonathan
Large Foundational Language Models are capable of performing many tasks at a high level but are difficult to deploy in many applications because of their size and proprietary ownership. Many will be motivated to distill specific capabilities of foundational models into smaller models that can be owned and controlled. In the development of a therapeutic chatbot, we wish to distill a capability known as reflective listening, in which a therapist produces reflections of client speech. These reflections either restate what a client has said, or connect what was said to a relevant observation, idea or guess that encourages and guides the client to continue contemplation. In this paper, we present a method for distilling the generation of reflections from a Foundational Language Model (GPT-4) into smaller models. We first show that GPT-4, using zero-shot prompting, can generate reflections at near 100% success rate, superior to all previous methods. Using reflections generated by GPT-4, we fine-tune different sizes of the GPT-2 family. The GPT-2-small model achieves 83% success on a hold-out test set and the GPT-2 XL achieves 90% success. We also show that GPT-4 can help in the labor-intensive task of evaluating the quality of the distilled models, using it as a zero-shot classifier. Using triple-human review as a guide, the classifier achieves a Cohen-Kappa of 0.66, a substantial inter-rater reliability figure.
- North America > Canada > Ontario > Toronto (0.14)
- Asia > Middle East > UAE > Abu Dhabi Emirate > Abu Dhabi (0.04)
- Health & Medicine > Consumer Health (0.47)
- Health & Medicine > Therapeutic Area > Psychiatry/Psychology (0.46)
How to Stop Google Bard From Storing Your Data and Location
With its most recent update, Google Bard can now sort through your trove of Google Docs, rediscover ancient Gmail messages, and search through every video on YouTube. Before experimenting too much with the new extensions available for Google's chatbot, it's worth going over the steps you can take to protect your privacy (and the ones you can't). Google Bard launched in March of this year, one month after OpenAI released ChatGPT to the public. You're likely familiar with how chatbots are designed to mimic human conversation, but Google's latest features are designed to give Bard more practical applications and uses. But when every conversation you have with Bard is tracked, logged, and used again to train the AI, how can you trust it with your data?
Police departments across America using AI to analyze officers' bodycam video
A company known as Truleo uses A.I. to process bodycam footage so law enforcement agencies can review their officers' behavior and actions on a daily basis. Law enforcement agencies are using artificial intelligence to analyze body camera video in an effort to improve trust and transparency in communities nationwide. Truleo automatically detects critical situations from body camera footage that involves use-of-force, pursuits and frisking. The A.I. platform also screens for both professional and unprofessional language. This automated analysis is readily available to supervisors within minutes so they can evaluate officers' conduct.
- North America > United States > Washington > King County > Seattle (0.06)
- North America > United States > Colorado (0.06)
- North America > United States > California (0.06)
- North America > United States > Alabama (0.06)
GPT4 is Slightly Helpful for Peer-Review Assistance: A Pilot Study
In this pilot study, we investigate the use of GPT4 to assist in the peer-review process. Our key hypothesis was that GPT-generated reviews could achieve comparable helpfulness to human reviewers. By comparing reviews generated by both human reviewers and GPT models for academic papers submitted to a major machine learning conference, we provide initial evidence that artificial intelligence can contribute effectively to the peer-review process. We also perform robustness experiments with inserted errors to understand which parts of the paper the model tends to focus on. Our findings open new avenues for leveraging machine learning tools to address resource constraints in peer review. The results also shed light on potential enhancements to the review process and lay the groundwork for further research on scaling oversight in a domain where human-feedback is increasingly a scarce resource.
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Canada > Alberta > Census Division No. 13 > Lac Ste. Anne County (0.04)
- Asia (0.04)
Generative AI: Implications and Applications for Education
Olga, Anastasia, Tzirides, null, Saini, Akash, Zapata, Gabriela, Searsmith, Duane, Cope, Bill, Kalantzis, Mary, Castro, Vania, Kourkoulou, Theodora, Jones, John, da Silva, Rodrigo Abrantes, Whiting, Jen, Kastania, Nikoleta Polyxeni
The launch of ChatGPT in November 2022 precipitated a panic among some educators while prompting qualified enthusiasm from others. Under the umbrella term Generative AI, ChatGPT is an example of a range of technologies for the delivery of computer-generated text, image, and other digitized media. This paper examines the implications for education of one generative AI technology, chatbots responding from large language models, or C-LLM. It reports on an application of a C-LLM to AI review and assessment of complex student work. In a concluding discussion, the paper explores the intrinsic limits of generative AI, bound as it is to language corpora and their textual representation through binary notation. Within these limits, we suggest the range of emerging and potential applications of Generative AI in education.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Europe > United Kingdom > England > Greater London > London (0.04)
- (10 more...)
- Research Report (1.00)
- Instructional Material > Course Syllabus & Notes (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning > Generative AI (1.00)
Does Google Hate AI Content? Not if You Follow E-E-A-T - Digital Purview
Google has traditionally been against automatically generated content (AGC) stating that it is against their webmaster guidelines and falls into the category of spam. Given AI written content is also auto-generated, does Google considers AI-written content spam and violating guidelines? Has the position changed over time? As per the latest updates, the position seems to have changed. Let's explore Google's current position and updates over time in this article.